40 research outputs found

    Self-Supervised and Controlled Multi-Document Opinion Summarization

    Full text link
    We address the problem of unsupervised abstractive summarization of collections of user generated reviews with self-supervision and control. We propose a self-supervised setup that considers an individual document as a target summary for a set of similar documents. This setting makes training simpler than previous approaches by relying only on standard log-likelihood loss. We address the problem of hallucinations through the use of control codes, to steer the generation towards more coherent and relevant summaries.Finally, we extend the Transformer architecture to allow for multiple reviews as input. Our benchmarks on two datasets against graph-based and recent neural abstractive unsupervised models show that our proposed method generates summaries with a superior quality and relevance.This is confirmed in our human evaluation which focuses explicitly on the faithfulness of generated summaries We also provide an ablation study, which shows the importance of the control setup in controlling hallucinations and achieve high sentiment and topic alignment of the summaries with the input reviews.Comment: 18 pages including 5 pages appendi

    Algoritmos para la bĂşsqueda eficiente de instancias similares

    Get PDF
    Tesis (Lic. en Ciencias de la Computación)--Universidad Nacional de Córdoba. Facultad de Matemática, Astronomía y Física, 2007.En el presente trabajo encaramos el desafío de buscar objetos similares dentro de una colección muy grande de estos objetos. Encontramos dos dificultades en éste problema: en primer lugar definir una medida de similitud entre dos objetos y luego implementar un algoritmo que, basandose en esa medida, encuentre de una manera eficiente los objetos suficientemente parecidos. La solución presentada utiliza una medida basada fuertemente en los conceptos de precisión y recall, obteniendose una medida similar a la de Jaccard. La eficiencia del algoritmo radica en la generación de grupos de objetos similares, y solamente después busca éstos objetos en la base de datos. Usamos éste algoritmo en dos aplicaciones: por un lado a una base de datos de usuarios que evalúan películas a fin de proyectar éstas notas. Por otro lado, la utilizamos para encontrar pérfiles genéticos que pueden haber aportado a una evidencia genética.Matthias Gallé

    Searching for Smallest Grammars on Large Sequences and Application to DNA

    Get PDF
    International audienceMotivated by the inference of the structure of genomic sequences, we address here the smallest grammar problem. In previous work, we introduced a new perspective on this problem, splitting the task into two different optimization problems: choosing which words will be considered constituents of the final grammar and finding a minimal parsing with these constituents. Here we focus on making these ideas applicable on large sequences. First, we improve the complexity of existing algorithms by using the concept of maximal repeats when choosing which substrings will be the constituents of the grammar. Then, we improve the size of the grammars by cautiously adding a minimal parsing optimization step. Together, these approaches enable us to propose new practical algorithms that return smaller grammars (up to 10\%) in approximately the same amount of time than their competitors on a classical set of genomic sequences and on whole genomes of model organisms

    In-place Update of Suffix Array while Recoding Words

    Get PDF
    International audienceMotivated by grammatical inference and data compression applications, we propose an algorithm to update a suffix array after the substitution, in the indexed text, of some occurrences of a given word by a new character. Compared to other published index update methods, the problem addressed here may require the modification of a large number of distinct positions over the original text. The proposed algorithm uses the specific internal order of suffix arrays in order to update simultaneously groups of entries, and ensures that only entries to be modified are visited. Experiments confirm a significant execution time speed-up compared to the construction of suffix array from scratch at each step of the application

    BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model

    Full text link
    The BigScience Workshop was a value-driven initiative that spanned one and half years of interdisciplinary research and culminated in the creation of ROOTS, a 1.6TB multilingual dataset that was used to train BLOOM, one of the largest multilingual language models to date. In addition to the technical outcomes and artifacts, the workshop fostered multidisciplinary collaborations around large models, datasets, and their analysis. This in turn led to a wide range of research publications spanning topics from ethics to law, data governance, modeling choices and distributed training. This paper focuses on the collaborative research aspects of BigScience and takes a step back to look at the challenges of large-scale participatory research, with respect to participant diversity and the tasks required to successfully carry out such a project. Our main goal is to share the lessons we learned from this experience, what we could have done better and what we did well. We show how the impact of such a social approach to scientific research goes well beyond the technical artifacts that were the basis of its inception.Comment: Presented at the 2022 NeurIPS Workshop on Broadening Research Collaborations in M
    corecore